Introduction

This notebook serves as a reporting tool for the CPSC. In this notebook, I laid out the questions CPSC is interested in learning from their SaferProduct API. The format will be that there are a few questions presented and each question will have a findings section where there is a quick summary of the findings while in Section 4, there will be further information on how on the findings were conducted.

Analysis

Given that the API was down during this time of the reporting, I obtained data from Ana Carolina Areias via Dropbox link. Here I cleaned up the pure JSON format and converted it into a dataframe (the cleaning code can be found in the exploratory.ipynb in the /notebook directory. After that I saved the data using pickle where I can easily load it up for analysis.

The questions answered here are the result of a conversation between DKDC and the CPSC regarding their priorities and what information is available from the data.

The main takeaways from this analysis is that:

  • From the self-reported statistics of people who reported their injury to the API, it appears that there is a skew against people who are older. The data shows that people who are reporting are 40-60 years old.
  • An overwhelming amount of reports did not involve bodily harm or require medical attention; much of the reports were just incident reports with a particular product
  • Out of the reports that resulted in some harm, the most reported product was in the footwear category regarding some harm and discomfort with walking with the Sketchers Tone-Ups shoes
  • Although not conclusive, but from the reports, there appears to be come indication that there are a lot of fire-related incidents from a cursory examination of the most popular words

In [104]:
import pickle
import operator

import numpy as np
import pandas as pd 
import gensim.models

In [3]:
data = pickle.load(open('/home/datauser/cpsc/data/processed/cleaned_api_data', 'rb'))
data.head()


Out[3]:
AnswerExplanation CompanyCommentsExpanded IncidentDate IncidentDescription IncidentProductDescription IncidentReportDate IncidentReportId IncidentReportNumber IncidentReportPublicationDate LocaleId ... __metadata LocaleDescription LocalePublicName GenderDescription GenderId GenderPublicName ProductCategoryDescription ProductCategoryPublicName SeverityTypeDescription SeverityTypePublicName
0 None Helen of Troy acknowledges receipt of the subm... /Date(1297036800000)/ Using the Revlon rv050 curling iron and came h... rv050 curling iron /Date(1299801600000)/ 1170172 20110311-B3E19-2147481666 /Date(1301709487243)/ -1 ... {u'type': u'CPSRMS_PUBModel.IncidentDetail', u... Unspecified Unspecified Missing Missing Missing Hair Curlers, Curling Irons, Clips & Hairpins Hair Curlers, Curling Irons, Clips & Hairpins Missing Missing
1 None None /Date(881884800000)/ On December 12th 1997, I found my son, Tyler J... ChildCraft drop side crib, oak /Date(1299801600000)/ 1170340 20110311-CFE0F-2147481661 /Date(1301674156573)/ 1 ... {u'type': u'CPSRMS_PUBModel.IncidentDetail', u... Home/Apartment/Condominium Home/Apartment/Condominium Male 2 Male Cribs Cribs Death Death
2 I have contacted the manufacturer several time... None /Date(1294876800000)/ I have a Frigidaire electric range that comes ... Electric Smoothtop Range /Date(1299801600000)/ 1170342 20110311-CFAB7-2147481658 /Date(1302051235430)/ -1 ... {u'type': u'CPSRMS_PUBModel.IncidentDetail', u... Unspecified Unspecified Missing Missing Missing Electric Ranges or Ovens (Excl Counter-top Ovens) Electric Ranges or Ovens (Excl Counter-top Ovens) Missing Missing
3 Sears wants $500 to fix this product. It is a... Sears Holdings takes product safety issues ver... /Date(1299801600000)/ Kenmore Elite Model #795.77543600\r\n\r\nThe l... Kenmore Elite Trio 25Cubic Feet Bottom Freezer... /Date(1299801600000)/ 1170344 20110311-EFD8B-2147481655 /Date(1301709795607)/ 1 ... {u'type': u'CPSRMS_PUBModel.IncidentDetail', u... Home/Apartment/Condominium Home/Apartment/Condominium Male 2 Male Refrigerators Refrigerators Received care that did not involve medical per... First Aid Received by Non-Medical Professional
4 I will be writing a letter to the company foll... Thank you for sharing your experience with us ... /Date(1299456000000)/ Since he was born two months ago, we have been... Pampers Swaddlers New Baby with Dry Max, Size 1-2 /Date(1299801600000)/ 1170347 20110311-DBB63-2147481650 /Date(1301661238360)/ 1 ... {u'type': u'CPSRMS_PUBModel.IncidentDetail', u... Home/Apartment/Condominium Home/Apartment/Condominium Male 2 Male Diapers Diapers Received care that did not involve medical per... First Aid Received by Non-Medical Professional

5 rows × 42 columns

Are there certain populations we're not getting reports from?

We can create a basic cross tab between age and gender to see if there are any patterns that emerges.


In [67]:
pd.crosstab(data['GenderDescription'], data['age_range'])


Out[67]:
age_range under 10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 over 100
GenderDescription
Female 1339 289 613 1179 1523 1485 899 240 71 7 1
Male 1347 324 450 902 1213 1328 1030 271 49 4 1
Missing 0 0 0 0 0 0 0 0 0 0 0
Unknown 23 0 1 1 6 9 5 2 1 0 0
Unspecified 50 0 10 14 25 26 21 11 1 1 0

From the data, it seems that there's not much underrepresentation by gender. There are only around a thousand less males than females in a dataset of 28,000. Age seems to be a bigger issue. There appears to be a lack of representation of older people using the API. Given that older folks may be less likely to self report, or if they wanted to self report, they may not be tech-savvy enough to use with a web interface. My assumption that people over 70 are probably experience product harm at a higher rate and are not reporting this.

If we wanted to raise awareness about a certain tool or item, where should we focus our efforts

To construct this, I removed any incidents that did not cause any bodily harm and taking the top ten categories. There were several levels of severity. We can remove complaints that does not involve any physical harm. After removing these complaint, it is really interesting to see that "Footwear" was the product category of harm.


In [80]:
#removing minor harm incidents
no_injuries = ['Incident, No Injury', 'Unspecified', 'Level of care not known',
               'No Incident, No Injury', 'No First Aid or Medical Attention Received']
damage = data.ix[~data['SeverityTypePublicName'].isin(no_injuries), :]
damage.ProductCategoryPublicName.value_counts()[0:9]


Out[80]:
Footwear                                                    774
Computers (Equipment and Electronic Games)                  274
Diapers                                                     156
Electric Ranges or Ovens (Excl Counter-top Ovens)           134
Bicycles and Accessories, (Excl.mountain or All-terrain)    108
Baby Strollers                                              108
Electric Coffee Makers or Teapots                           100
Cribs                                                        94
Bassinets or Cradles                                         88
Name: ProductCategoryPublicName, dtype: int64

This is actually preplexing, so I decided to investigate further by analyzing the complaints filed for the "Footwear" category. To do this, I created a Word2Vec model that uses a convolution neural network for text analysis. This process maps a word and the linguistic context it is in to be able to calculate similarity between words. The purpose of this is to find words that related to each other. Rather than doing a simple cross tab of product categories, I can ingest the complaints and map out their relationship. For instance, using the complaints that resulted in bodily harm, I found that footwear was associated with pain and walking. It seems that there is injuries related to Sketcher sneakers specifically since it was the only brand that showed up enough to be included in the model's dictionary. In fact, there was a lawsuit regarding Sketchers and their toning shoes

Are there certain complaints that people are filing? Quality issues vs injuries?

Look below, we see that a vast majority are incidents with any bodily harm. Over 60% of all complaints were categorized as Incident, No Injury.


In [81]:
data.SeverityTypeDescription.value_counts()


Out[81]:
Incident, No Injury                                                                                                                                                                                                          17916
No First Aid or Medical Attention Received                                                                                                                                                                                    2672
Received care that did not involve medical personnel (doctor, nurse, Emergency Medical Technician (EMT), etc.)                                                                                                                2177
Treated by medical personnel (doctor, nurse, etc.) in any setting except a hospital emergency department.  Includes both medical (doctor's office, clinic, etc.) and non-medical (school, accident scene, etc.) settings.     1470
Unspecified                                                                                                                                                                                                                   1330
Treated and released from a hospital emergency department                                                                                                                                                                     1090
Level of care not known                                                                                                                                                                                                        715
Admitted for hospitalization                                                                                                                                                                                                   456
No Incident, No Injury                                                                                                                                                                                                         224
Missing                                                                                                                                                                                                                        135
Death                                                                                                                                                                                                                           93
Name: SeverityTypeDescription, dtype: int64

Although, while it is label to have no injury, it does not necessarily mean that we can't take precaution. What I did was take the same approach as the previous model, I subsetted the data to only complaints that had "no injury" and ran a model to examine words used. From the analysis, we see that the word to, was, and it were the top three words. At first glance, it may seem that these words are meaningless, however if we examine words that are similar to it, we can start seeing a connection.

For instance, the word most closely related to "to" was "unable" and "trying", which conveys a sense of urgency in attempting to turn something on or off. Examining the words "unable," I was able to see it was related to words such as "attempted" and "disconnect." Further investigation lead me to find it was dealing with a switch or a plug, possibly dealing with an electrical item.

A similar picture is painted when trying to examine the word "was." The words that felt out of place was "emitting", "extinguish," and "smelled." It is not surprise that after a few investigations of these words, that words like "sparks" and "smoke" started popping up more. This leads me to believe that these complaints have something to do with encounters closely related to fire.

So while these complaints are maybe encounters with danger, it may be worthwile to review these complaints further with an eye out for fire related injuries or products that could cause fire.


In [121]:
model.most_similar('was')


Out[121]:
[('came', 0.524124026298523),
 ('emitting', 0.5203149318695068),
 ('examined', 0.4794829785823822),
 ('smelled', 0.4679615795612335),
 ('immediately', 0.46619912981987),
 ('arrived', 0.4562903642654419),
 ('extinguish', 0.45025503635406494),
 ('sounded', 0.4436829388141632),
 ('next', 0.4382171630859375),
 ('mins', 0.43233776092529297)]

Who are the people who are actually reporting to us?

This question is difficult to answer because of a lack of data on the reporter. From the cross tabulation in Section 3.1, we see that the majority of our the respondents are female and the largest age group are 40-60. That is probably the best guess of who are the people who are using the API.

Conclusion

This is meant to serve as a starting point on examining the API data. The main findings were that:

  • From the self-reported statistics of people who reported their injury to the API, it appears that there is a skew against people who are older. The data shows that people who are reporting are 40-60 years old.
  • An overwhelming amount of reports did not involve bodily harm or require medical attention; much of the reports were just incident reports with a particular product
  • Out of the reports that resulted in some harm, the most reported product was in the footwear category regarding some harm and discomfort with walking with the Sketchers Tone-Ups shoes
  • Although not conclusive, but from the reports, there appears to be come indication that there are a lot of fire-related incidents from a cursory examination of the most popular words

While text analysis is helpful, it is often not sufficient. What would really help the analysis process would be include more information from the user. The following information would be helpful to collect to make conduct more actionable insight.

  • Ethnicity/Race
  • Self Reported Income
  • Geographic information
    • Region (Mid Atlantic, New England, etc)
    • Closest Metropolitan Area
    • State
    • City
  • Geolocation of IP address
    • coordinates can be "jittered" to conserve anonymity

A great next step would be a deeper text analysis on shoes. It may be possible to train a neural network to consider smaller batches of words so we can capture the context better. Other steps that I would do if I had more time would be to find a way to fix up unicode issues with some of the complaints (there were special characters that prevented some of the complaints to be converted into strings). I would also look further into the category that had the most overall complaints: "Electric Ranges and Stoves" and see what the complaints were.

If we could implement these challenges, there is no doubt we could gain some valuable insights on products that are harming Americans. This report serves as the first step. I would like to thank CPSC for this data set and DKDC for the opportunity to conduct this analysis.

References

Question 2.1

The data that we worked with had limited information regarding the victim's demographics beside age and gender. However, that was enough to draw some base inferences. Below we can grab a counts of gender, which a plurality is females.

Age is a bit tricky, we have the victim's birthday in months. I converted it into years and break them down into 10 year age ranges so we can better examine the data.


In [4]:
data.GenderDescription.value_counts()


Out[4]:
Female         10629
Male            9489
Unspecified     6068
Unknown         1957
Missing          135
Name: GenderDescription, dtype: int64

In [46]:
data['age'] = map(lambda x: x/12, data['VictimAgeInMonths'])
labels = ['under 10', '10-20', '20-30', '30-40', '40-50', '50-60',
         '60-70','70-80', '80-90', '90-100', 'over 100']
data['age_range'] = pd.cut(data['age'], bins=np.arange(0,120,10), labels=labels)
data['age_range'][data['age'] > 100] = 'over 100'

In [66]:
counts = data['age_range'].value_counts()
counts.sort()
counts


/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:2: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting
  from ipykernel import kernelapp as app
Out[66]:
over 100       2
90-100        12
80-90        122
70-80        524
10-20        613
20-30       1074
60-70       1955
30-40       2096
under 10    2759
40-50       2767
50-60       2848
Name: age_range, dtype: int64

However after doing this, we still have around 13,000 people with an age of zero, whether it is that they did not fill in the age or that the incident involves infant is still unknown but looking at the distribution betweeen of the product that are affecting people with an age of 0 and the overall dataset, it appears that null values in the age range represents people who did not fill out an age when reporting


In [58]:
#Top products affect by people with 0 age
data.ix[data['age_range'].isnull(), 'ProductCategoryPublicName'].value_counts()[0:9]


Out[58]:
Electric Ranges or Ovens (Excl Counter-top Ovens)    1713
Dishwashers                                          1089
Microwave Ovens                                       691
Refrigerators                                         519
Gas Ranges or Ovens                                   493
Electric Coffee Makers or Teapots                     317
Ranges or Ovens, Not Specified                        279
Light Bulbs                                           267
Washing Machines, Other or Not Specified              249
Name: ProductCategoryPublicName, dtype: int64

In [59]:
#top products that affect people overall
data.ProductCategoryPublicName.value_counts()[0:9]


Out[59]:
Electric Ranges or Ovens (Excl Counter-top Ovens)    2704
Dishwashers                                          1605
Microwave Ovens                                      1095
Footwear                                              949
Refrigerators                                         888
Gas Ranges or Ovens                                   872
Computers (Equipment and Electronic Games)            838
Electric Coffee Makers or Teapots                     748
Nonmetal Cookware (Nonelectric)                       530
Name: ProductCategoryPublicName, dtype: int64

Question 2.2

At first glance, we can look at the products that were reported, like below. And see that Eletric Ranges or Ovens is at top in terms of harm. However, there are levels of severity within the API that needs to be filtered before we can assess which products causes the most harm.


In [70]:
#overall products listed
data.ProductCategoryPublicName.value_counts()[0:9]


Out[70]:
Electric Ranges or Ovens (Excl Counter-top Ovens)    2704
Dishwashers                                          1605
Microwave Ovens                                      1095
Footwear                                              949
Refrigerators                                         888
Gas Ranges or Ovens                                   872
Computers (Equipment and Electronic Games)            838
Electric Coffee Makers or Teapots                     748
Nonmetal Cookware (Nonelectric)                       530
Name: ProductCategoryPublicName, dtype: int64

In [73]:
#removing minor harm incidents
no_injuries = ['Incident, No Injury', 'Unspecified', 'Level of care not known',
               'No Incident, No Injury', 'No First Aid or Medical Attention Received']
damage = data.ix[~data['SeverityTypePublicName'].isin(no_injuries), :]
damage.ProductCategoryPublicName.value_counts()[0:9]


Out[73]:
Footwear                                                    774
Computers (Equipment and Electronic Games)                  274
Diapers                                                     156
Electric Ranges or Ovens (Excl Counter-top Ovens)           134
Bicycles and Accessories, (Excl.mountain or All-terrain)    108
Baby Strollers                                              108
Electric Coffee Makers or Teapots                           100
Cribs                                                        94
Bassinets or Cradles                                         88
Name: ProductCategoryPublicName, dtype: int64

This shows that incidents where there are actually injuries and medical attention was given was that in footwear, which was weird. To explore this, I created a Word2Vec model that maps out how certain words relate to each other. To train the model, I used the comments that were made from the API. This will train a model and help us identify words that are similar. For instance, if you type in foot, you will get left and right as these words are most closely related to the word foot. However after some digging around, I found out that the word "walking" was associated with "painful". I have some reason to believe that there are orthopedic injuries associated with shoes and people have been experience pain while walking with Sketchers that were supposed to tone up their bodies and having some instability or balance issues.


In [115]:
model = gensim.models.Word2Vec.load('/home/datauser/cpsc/models/footwear')
model.most_similar('walking')


Out[115]:
[('while', 0.9986404180526733),
 ('up', 0.9979861378669739),
 ('skecher', 0.9972473978996277),
 ('toning', 0.9970976114273071),
 ('suffered', 0.9960523843765259),
 ('sketchers', 0.9956487417221069),
 ('bought', 0.9945806264877319),
 ('instability', 0.9944456815719604),
 ('wore', 0.9942538142204285),
 ('fell', 0.9942276477813721)]

In [84]:
model.most_similar('injury')


Out[84]:
[('of', 0.9983473420143127),
 ('bottom', 0.9975317716598511),
 ('shoe', 0.9967656135559082),
 ('began', 0.9965482950210571),
 ('sneakers', 0.9964351654052734),
 ('stairs', 0.9962997436523438),
 ('balance', 0.9961462020874023),
 ('ago', 0.9960271716117859),
 ('last', 0.9959736466407776),
 ('new', 0.9958116412162781)]

In [94]:
model.most_similar('instability')


Out[94]:
[('due', 0.9992794990539551),
 ('suffered', 0.9977741241455078),
 ('while', 0.9953591823577881),
 ('up', 0.9946755170822144),
 ('walking', 0.9944456815719604),
 ('sketchers', 0.994174599647522),
 ('skecher', 0.9933757781982422),
 ('toning', 0.9927452802658081),
 ('sketcher', 0.9924877882003784),
 ('of', 0.9900130033493042)]

Question 2.3


In [122]:
model = gensim.models.Word2Vec.load('/home/datauser/cpsc/models/severity')
items_dict = {}
for word, vocab_obj in model.vocab.items():
    items_dict[word] = vocab_obj.count
sorted_dict = sorted(items_dict.items(), key=operator.itemgetter(1))
sorted_dict.reverse()
sorted_dict[0:5]


Out[122]:
[('to', 55216), ('was', 37765), ('it', 35055), ('is', 21825), ('not', 19165)]